R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.41
[13] digest_0.6.33 rlang_1.1.2 evaluate_0.23
Q1. Git/GitHub
No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Create a private repository biostat-203b-2024-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; jonathanhori and jasenzhang1 for Lec 80) as your collaborators with write permission.
Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.
After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
I also obtained the PhysioNet credential for using the MIMIC-IV data. Here is the screenshot of my PhysioNet credential.
Q3. Linux Shell Commands
Make the MIMIC v2.2 data available at location ~/mimic.
ls-l ~/mimic/
Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.
Use Bash commands to answer following questions.
Answer: I created a symbolic link mimic to my MIMIC data folder. Here is the output of ls -l ~/mimic/:
ls-l ~/mimic/
total 48
-rw-rw-r--@ 1 zhangjiyin staff 13332 Jan 5 2023 CHANGELOG.txt
-rw-rw-r--@ 1 zhangjiyin staff 2518 Jan 5 2023 LICENSE.txt
-rw-rw-r--@ 1 zhangjiyin staff 2884 Jan 6 2023 SHA256SUMS.txt
drwxr-xr-x@ 24 zhangjiyin staff 768 Jan 5 23:41 hosp
drwxr-xr-x@ 11 zhangjiyin staff 352 Jan 5 23:41 icu
lrwxr-xr-x 1 zhangjiyin staff 61 Jan 24 22:46 mimic-iv-2.2 -> /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2
Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Answer:
Here is the output of ls -l ~/mimic/hosp/:
ls-l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 zhangjiyin staff 15516088 Jan 5 2023 admissions.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 427468 Jan 5 2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 859438 Jan 5 2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 578517 Jan 5 2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 12900 Jan 5 2023 d_labitems.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 25070720 Jan 5 2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 7426955 Jan 5 2023 drgcodes.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 508524623 Jan 5 2023 emar.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 471096030 Jan 5 2023 emar_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 1767138 Jan 5 2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 1939088924 Jan 5 2023 labevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 96698496 Jan 5 2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 36124944 Jan 5 2023 omr.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 2312631 Jan 5 2023 patients.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 398753125 Jan 5 2023 pharmacy.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 498505135 Jan 5 2023 poe.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 25477219 Jan 5 2023 poe_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 458817415 Jan 5 2023 prescriptions.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 6027067 Jan 5 2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 122507 Jan 5 2023 provider.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 6781247 Jan 5 2023 services.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 36158338 Jan 5 2023 transfers.csv.gz
Here is the output of ls -l ~/mimic/icu/:
ls-l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 zhangjiyin staff 35893 Jan 5 2023 caregiver.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 2467761053 Jan 5 2023 chartevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 57476 Jan 5 2023 d_items.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 45721062 Jan 5 2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 2614571 Jan 5 2023 icustays.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 251962313 Jan 5 2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 324218488 Jan 5 2023 inputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 38747895 Jan 5 2023 outputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin staff 20717852 Jan 5 2023 procedureevents.csv.gz
Gzip compression reduces the size of the files, making them smaller and more efficient for storage and transmission. The gzip compression is lossless, meaning that the decompressed data is identical to the original data. Also, users can download and access compressed files more quickly than their uncompressed counterparts.
Hosp: The Hosp module provides all data acquired from the hospital wide electronic health record. Information covered includes patient and admission information, laboratory measurements, microbiology, medication administration, and billed diagnoses.
ICU: The ICU module contains information collected from the clinical information system used within the ICU. Documented data includes intravenous administrations, ventilator settings, and other charted items.
ED: The ED module contains data for emergency department patients collected while they are in the ED. Information includes reason for admission, triage assessment, vital signs, and medicine reconciliaton. The subject_id and hadm_id identifiers allow MIMIC-IV-ED to be linked to other MIMIC-IV modules.
CXR: The CXR module provides lookup tables linking patient identifiers with MIMIC-CXR study_id and dicom_id, allowing analysis of patient chest x-rays to be linked with the clinical data from other MIMIC-IV modules.
Note: (NOT PUBLICLY AVAILABLE): The Note module contains deidentified free-text clinical notes for hospitalized patients.
Briefly describe what Bash commands zcat, zless, zmore, and zgrep do. Answer:
zcat: zcat is used to display the contents of one or more compressed file without actually uncompressing it. It is equivalent to gzip -cd.
zless: zless is used to view the contents of a compressed file one screen at a time. It is equivalent to gzip -cd | less. less is an improved version of more with additional features. It allows both forward and backward navigation through the file. You can use the arrow keys, Page Up, Page Down, and other keys for navigation. Press ‘q’ to exit. less supports searching, highlighting, and can display line numbers.
zmore: zmore is used to view the contents of a compressed file one screen at a time. It is equivalent to gzip -cd | more. You can press the spacebar to advance to the next screen, and press the Enter key to move one line at a time.
zgrep: zgrep is used to search through one or more compressed files for a string of characters that matches a specified pattern. It is equivalent to gzip -cd | grep.
(Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gzdols-l$datafiledone
Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)
Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)
What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)
To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)
Q4. Who’s popular in Price and Prejudice
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
Explain what wget -nc does. Do not put this text file pg42671.txt in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
Answer:wget -nc downloads the file from the URL if the file does not exist in the current directory.
for char in Elizabeth Jane Lydia Darcydoecho$char:# some bash commands heregrep-o-i$char pg42671.txt |wc-ldone
It shows that Elizabeth was the most mentioned. She was mentioned 634 times in the book. Darcy was mentioned 418 times in the book. Jane was mentioned 293 times in the book. Lydia was mentioned 171 times in the book. The -i option in command grep is used for case-insensitive searching. The -o option in command grep is used for printing each match on a new line . The -l option in command wc is used for printing the number of lines in a file.
What’s the difference between the following two commands?
echo'hello, world'> test1.txt
and
echo'hello, world'>> test2.txt
Answer: The first command overwrites the file test1.txt if the file exists. The second command appends the text to the file test2.txt if the file exists.
Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
#!/bin/sh# Select lines from the middle of a file.# Usage: bash middle.sh filename end_line num_lineshead-n"$2""$1"|tail-n"$3"
Using chmod to make the file executable by the owner, and run
./middle.sh pg42671.txt 20 5
Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?
Answer: The output is the 5 lines from line 16 to line 20 of the file pg42671.txt. The "$1" is the first argument of the shell script, the file name, pg42671.txt. The "$2" is the second argument of the shell script, the end line. The "$3" is the third argument of the shell script, the number of lines. Therefore, the shell script selects the lines from the middle of the file. To elucidate, the shell script first use command head to select the first 20 lines from the file pg42671.txt and then pass this output to the command tail. Then, command tail selects the last 5 lines from the output generated by command head.
The #! symbol is called a shebang or hashbang. It is a special character sequence that appears at the beginning of a script or an executable file in Unix-like operating systems. The shebang is followed by the path to the interpreter that should be used to execute the script. Therefore, the first line of the shell script tells the system to use the Bourne shell (/bin/sh) as the interpreter for executing the script.
Q5. More fun with Linux
Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.
Answer: Here is the output of the commands:
cal
January 2024
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 _2_5 26 27
28 29 30 31
cal: display the calendar of the current month.
cal 2024
2024
January February March
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2
7 8 9 10 11 12 13 4 5 6 7 8 9 10 3 4 5 6 7 8 9
14 15 16 17 18 19 20 11 12 13 14 15 16 17 10 11 12 13 14 15 16
21 22 23 24 _2_5 26 27 18 19 20 21 22 23 24 17 18 19 20 21 22 23
28 29 30 31 25 26 27 28 29 24 25 26 27 28 29 30
31
April May June
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 4 1
7 8 9 10 11 12 13 5 6 7 8 9 10 11 2 3 4 5 6 7 8
14 15 16 17 18 19 20 12 13 14 15 16 17 18 9 10 11 12 13 14 15
21 22 23 24 25 26 27 19 20 21 22 23 24 25 16 17 18 19 20 21 22
28 29 30 26 27 28 29 30 31 23 24 25 26 27 28 29
30
July August September
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2 3 4 5 6 7
7 8 9 10 11 12 13 4 5 6 7 8 9 10 8 9 10 11 12 13 14
14 15 16 17 18 19 20 11 12 13 14 15 16 17 15 16 17 18 19 20 21
21 22 23 24 25 26 27 18 19 20 21 22 23 24 22 23 24 25 26 27 28
28 29 30 31 25 26 27 28 29 30 31 29 30
October November December
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 1 2 1 2 3 4 5 6 7
6 7 8 9 10 11 12 3 4 5 6 7 8 9 8 9 10 11 12 13 14
13 14 15 16 17 18 19 10 11 12 13 14 15 16 15 16 17 18 19 20 21
20 21 22 23 24 25 26 17 18 19 20 21 22 23 22 23 24 25 26 27 28
27 28 29 30 31 24 25 26 27 28 29 30 29 30 31
cal 2024: display the calendar of the year 2024.
cal 9 1752
September 1752
Su Mo Tu We Th Fr Sa
1 2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
cal 9 1752: display the calendar of the month September in the year 1752. The calendar of September 1752 is unusual because the Gregorian calendar was adopted in the British Empire in September 1752. The calendar was changed from the Julian calendar to the Gregorian calendar. The Julian calendar was 11 days behind the Gregorian calendar. So the 11 days from September 3 to September 13 were skipped.
date
Thu Jan 25 10:41:21 PST 2024
date: display the current date and time.
hostname
zhangjiyindeAir.lan
hostname: display the name of the host.
arch
arm64
arch: display the machine hardware name.
uname-a
Darwin zhangjiyindeAir.lan 21.6.0 Darwin Kernel Version 21.6.0: Thu Mar 9 20:10:19 PST 2023; root:xnu-8020.240.18.700.8~1/RELEASE_ARM64_T8101 arm64
uptime: display the current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.
who am i
zhangjiy tty?? Jan 25 10:41
who am i: display the current user.
who
zhangjiyin console Jan 23 01:06
zhangjiyin ttys000 Jan 24 00:52
who: display the users who are currently logged in.
# w
w: display the users who are currently logged in and what they are doing.
Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)
The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.
For grading purpose, include a screenshot of Section 4.1.5 of the book here.
Answer:
I was also able to build git_book and epub_book but not pdf_book. Here is the screenshot of Section 4.1.5 of the git_book. Here is the screenshot of Section 4.1.5 of the epub_book.